5 research outputs found
On the Usefulness of Synthetic Tabular Data Generation
Despite recent advances in synthetic data generation, the scientific
community still lacks a unified consensus on its usefulness. It is commonly
believed that synthetic data can be used for both data exchange and boosting
machine learning (ML) training. Privacy-preserving synthetic data generation
can accelerate data exchange for downstream tasks, but there is not enough
evidence to show how or why synthetic data can boost ML training. In this
study, we benchmarked ML performance using synthetic tabular data for four use
cases: data sharing, data augmentation, class balancing, and data
summarization. We observed marginal improvements for the balancing use case on
some datasets. However, we conclude that there is not enough evidence to claim
that synthetic tabular data is useful for ML training.Comment: Data-centric Machine Learning Research (DMLR) Workshop at the 40th
International Conference on Machine Learning (ICML
Black-box Coreset Variational Inference
Recent advances in coreset methods have shown that a selection of
representative datapoints can replace massive volumes of data for Bayesian
inference, preserving the relevant statistical information and significantly
accelerating subsequent downstream tasks. Existing variational coreset
constructions rely on either selecting subsets of the observed datapoints, or
jointly performing approximate inference and optimizing pseudodata in the
observed space akin to inducing points methods in Gaussian Processes. So far,
both approaches are limited by complexities in evaluating their objectives for
general purpose models, and require generating samples from a typically
intractable posterior over the coreset throughout inference and testing. In
this work, we present a black-box variational inference framework for coresets
that overcomes these constraints and enables principled application of
variational coresets to intractable models, such as Bayesian neural networks.
We apply our techniques to supervised learning problems, and compare them with
existing approaches in the literature for data summarization and inference.Comment: NeurIPS 202
Quantifying Privacy Loss of Human Mobility Graph Topology
Abstract
Human mobility is often represented as a mobility network, or graph, with nodes representing places of significance which an individual visits, such as their home, work, places of social amenity, etc., and edge weights corresponding to probability estimates of movements between these places. Previous research has shown that individuals can be identified by a small number of geolocated nodes in their mobility network, rendering mobility trace anonymization a hard task. In this paper we build on prior work and demonstrate that even when all location and timestamp information is removed from nodes, the graph topology of an individual mobility network itself is often uniquely identifying. Further, we observe that a mobility network is often unique, even when only a small number of the most popular nodes and edges are considered. We evaluate our approach using a large dataset of cell-tower location traces from 1 500 smartphone handsets with a mean duration of 430 days. We process the data to derive the top−N places visited by the device in the trace, and find that 93% of traces have a unique top−10 mobility network, and all traces are unique when considering top−15 mobility networks. Since mobility patterns, and therefore mobility networks for an individual, vary over time, we use graph kernel distance functions, to determine whether two mobility networks, taken at different points in time, represent the same individual. We then show that our distance metrics, while imperfect predictors, perform significantly better than a random strategy and therefore our approach represents a significant loss in privacy.</jats:p
Recommended from our members
Countering Acoustic Adversarial Attacks in Microphone-equipped Smart Home Devices
Deep neural networks (DNNs) continue to demonstrate superior generalization performance in an increasing range of applications, including speech recognition and image understanding. Recent innovations in compression algorithms, design of efficient architectures and hardware accelerators have prompted a rapid growth in deploying DNNs on mobile and IoT devices to redefine user experiences. Relying on the superior inference quality of DNNs, various voice-enabled devices have started to pervade our everyday lives and are increasingly used for, e.g., opening and closing doors, starting or stopping washing machines, ordering products online, and authenticating monetary transactions. As the popularity of these voice-enabled services increases, so does their risk of being attacked. Recently, DNNs have been shown to be extremely brittle under adversarial attacks and people with malicious intentions can potentially exploit this vulnerability to compromise DNN-based voice-enabled systems. Although some existing work already highlights the vulnerability of audio models, very little is known of the behaviour of compressed on-device audio models under adversarial attacks. This paper bridges this gap by investigating thoroughly the vulnerabilities of compressed audio DNNs and makes a stride towards making compressed models robust. In particular, we propose a stochastic compression technique that generates compressed models with greater robustness to adversarial attacks. We present an extensive set of evaluations on adversarial vulnerability and robustness of DNNs in two diverse audio recognition tasks, while considering two popular attack algorithms: FGSM and PGD. We found that error rates of conventionally trained audio DNNs under attack can be as high as 100%. Under both white- and black-box attacks, our proposed approach is found to decrease the error rate of DNNs under attack by a large margin.Noki